Using R and Longitudinal Data System Records to Answer Policy Questions

Jared Knowles

Overview

Why R?

R works for government

R

Google Scholar Hits

R has recently passed Stata on Google Scholar hits and it is catching up to the two major players SPSS and SAS

R Has an Active Web Presence

R is linked to from more and more sites

R Extensions

These links come from the explosion of add-on packages to R

R Has an Active Community

Usage of the R listserv for help has really exploded recently

R Examples

Read in Data

studat <- read.csv("data/smalldata.csv")
str(studat[, 24:32])
## 'data.frame':    2700 obs. of  9 variables:
##  $ schoolscore: num  29.2 56 56 56 56 ...
##  $ district   : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ schoolhigh : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolavg  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ schoollow  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ readSS     : num  357 264 370 347 373 ...
##  $ mathSS     : num  387 303 365 344 441 ...
##  $ proflvl    : Factor w/ 4 levels "advanced","basic",..: 2 3 2 2 2 4 4 4 3 2 ...
##  $ race       : Factor w/ 5 levels "A","B","H","I",..: 2 2 2 2 2 2 2 2 2 2 ...
# source('data/simulate_data.R')

Simple Diagnostics

source("ggplot2themes.R")
library(ggplot2)
qplot(readSS, mathSS, data = studat, alpha = I(0.2)) + geom_smooth(aes(group = ell, 
    color = factor(ell))) + theme_dpi()
plot of chunk unnamed-chunk-1

plot of chunk unnamed-chunk-1

Advanced Diagnostics

samp <- sample(studat$stuid, 24)
plotsub <- subset(studat, stuid %in% samp)
qplot(grade, readSS, data = plotsub) + facet_wrap(~stuid, nrow = 4, 
    ncol = 6) + theme_dpi() + geom_line() + geom_smooth(method = "lm", se = FALSE)
plot of chunk unnamed-chunk-2

plot of chunk unnamed-chunk-2

More Advanced

Can this generate information?

  1. R can do advanced analytics that provide insight
  2. R can graphically depict those analytics in simple ways that are intuitive to policy makers
  1. BLBC study in Wisconsin
  2. Regression Trees
  3. Machine Learning Algorithms

BLBC in Wisconsin

Evaluations of Policy

Results I

Results II

Conclusions and Next Steps

Next Steps

Inference Trees

Inference Tree Example

library(partykit)
mypar <- ctree_control(testtype = "Bonferroni", mincriterion = 0.99)
mytree <- ctree(mathSS ~ race + econ + ell + disab + sch_fay + dist_fay + 
    attday + readSS, data = subset(studat, grade == 3))
plot(mytree)
plot of chunk parttree

plot of chunk parttree

R is a powerful platform

Can Standardize and Share / Compare Results

Code collaboration example

Some code sharing exists

Race/Ethnicity Example

##   sid school_year race_ethnicity
## 1   1        2004              B
## 2   1        2005              H
## 3   1        2006              H
## 4   1        2007              H
## 5   2        2006              W
## 6   2        2007              B

What business rules do we use?

What to do?

head(stuatt, 4)
##   sid school_year race_ethnicity
## 1   1        2004              B
## 2   1        2005              H
## 3   1        2006              H
## 4   1        2007              H

Fix the data

stuatt$race2 <- stuatt$race_ethnicity
stuatt$race2[[1]] <- "H"
head(stuatt, 4)
##   sid school_year race_ethnicity race2
## 1   1        2004              B     H
## 2   1        2005              H     H
## 3   1        2006              H     H
## 4   1        2007              H     H
statamode <- function(x) {
    z <- table(as.vector(x))
    m <- names(z)[z == max(z)]
    if (length(m) == 1) {
        return(m)
    }
    return(".")
}

Fixing data in a few simple steps

require(plyr)
modes <- ddply(stuatt, .(sid), summarize, race_temp = statamode(race_ethnicity), 
    nvals = length(unique(race_ethnicity)), most_recent_year = max(school_year), 
    most_recent_race = tail(race_ethnicity, 1))

modes$race2[modes$race_temp != "."] <- modes$race_temp[modes$race_temp != 
    "."]

modes$race2[modes$race_temp == "."] <- as.character(modes$most_recent_race[modes$race_temp == 
    "."])

stuatt2 <- merge(stuatt, modes)
stuatt2$race_ethnicity <- stuatt2$race2
stuatt2 <- subset(stuatt2, select = c("sid", "school_year", "race_ethnicity"))

Results

head(stuatt)
##   sid school_year race_ethnicity race2
## 1   1        2004              B     H
## 2   1        2005              H     H
## 3   1        2006              H     H
## 4   1        2007              H     H
## 5   2        2006              W     W
## 6   2        2007              B     B
head(stuatt2)
##   sid school_year race_ethnicity
## 1   1        2004              H
## 2   1        2007              H
## 3   1        2006              H
## 4   1        2005              H
## 5 100        2005              A
## 6 100        2006              A

What happened?

Next Steps

A Data Mining Example

Do analytics on fixed data

# Set aside test set
testset <- sample(unique(student_long$stuid), 190000)
student_long$case <- 0
student_long$case[student_long$stuid %in% testset] <- 1
# Draw a training set of data (random subset of students)
training <- subset(student_long, case == 0)
testing <- subset(student_long, case == 1)

training <- training[, c(3, 6:16, 21, 22, 28, 29, 30)]  # subset vars
trainX <- training[, 1:15]

# Parameters
ctrl <- trainControl(method = "repeatedcv", number = 15, repeats = 5, 
    summaryFunction = defaultSummary)
# Search grid
grid <- expand.grid(.interaction.depth = seq(2, 6, by = 1), .n.trees = seq(200, 
    800, by = 50), .shrinkage = c(0.01, 0.1))
# Boosted tree search
gbmTune <- train(x = trainX, y = training$mathSS, method = "gbm", 
    metric = "RMSE", trControl = ctrl, tuneGrid = grid, verbose = FALSE)
gbmPred <- predict(gbmTune, testing[, names(trainX)])

# svmTune<-train(x=trainX, y=training$mathSS, method='svmLinear',
# tuneLength=3, metric='RMSE', trControl=ctrl)

Machine Learning

# plot(gbmTune)

Predictions

## qplot(testing$mathSS,gbmPred,geom='hex',binwidth=c(10,10))+geom_smooth()+theme_dpi()

Deviance

## qplot(testing$mathSS,testing$mathSS-gbmPred,geom='hex',binwidth=c(10,10))+# geom_smooth()+theme_dpi()

Deviance (II)

# qplot(testing$mathSS-gbmPred,binwidth=7)+theme_dpi()+xlim(c(-60,60))

The best part

How to learn?

Online Tutorials

DPI R Bootcamp

Questions

Discussion

Session Info

This document is produced with knitr version 0.6.3. Here is my session info:

print(sessionInfo(), locale = FALSE)
## R version 2.15.0 (2012-03-30)
## Platform: i386-pc-mingw32/i386 (32-bit)
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] plyr_1.7.1     foreign_0.8-49 partykit_0.1-4 mgcv_1.7-13   
## [5] ggplot2_0.9.1  gridExtra_0.9  knitr_0.6.3   
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2      
##  [4] evaluate_0.4.2     formatR_0.5        labeling_0.1      
##  [7] lattice_0.20-6     MASS_7.3-17        Matrix_1.0-6      
## [10] memoise_0.1        munsell_0.3        nlme_3.1-103      
## [13] parser_0.0-15      proto_0.3-9.2      RColorBrewer_1.0-5
## [16] Rcpp_0.9.10        reshape2_1.2.1     scales_0.2.1      
## [19] splines_2.15.0     stringr_0.6        survival_2.36-12  
## [22] tools_2.15.0